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Crowdsourced  evaluation  is  a  promising  method  of  evaluat¬ 
ing  attributes  of  a  design  that  require  human  input,  such  as 
maintainability  of  a  vehicle.  The  challenge  is  to  correctly  es¬ 
timate  the  design  scores  using  a  massive  and  diverse  crowd, 
particularly  when  only  a  minority  of  evaluators  give  correct 
evaluations.  As  an  alternative  to  simple  averaging,  this 
paper  introduces  a  Bayesian  network  approach  that  models 
the  human  evaluation  process  and  estimates  design  scores, 
taking  human  abilities  in  evaluating  the  design  into  account. 
Simulation  results  indicate  that  the  proposed  method  is 
preferred  to  averaging  since  it  identifies  the  experts  from 
the  crowd,  under  the  assumptions  that  (1)  experts  do  exist 
and  (2)  only  experts  have  consistent  evaluations.  These 
assumptions,  however,  do  not  always  hold  as  indicated  by 
the  results  of  a  human  study.  Clusters  of  consistent  yet 
incorrect  human  evaluators  are  shown  to  exist  along  with 
the  cluster  of  experts.  This  suggests  that  additional  data 
such  as  evaluators’  background  are  needed  to  isolate  the 
correct  clusters  of  experts  for  design  evaluation  tasks  . 
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1  Introduction 

Suppose  we  wish  to  evaluate  a  set  of  military  vehicle  de¬ 
sign  concepts  with  respect  to  objective  mission  performance 
attributes.  For  many  objective  attributes,  the  “true  score” 
may  be  determined  using  detailed  physics-based  simulations, 
such  as  finite-element  analysis  to  evaluate  blast  resistance  or 
human  mobility  modeling  to  evaluate  ergonomics;  however, 
for  some  objective  attributes,  such  as  situational  awareness, 
physics-based  simulation  is  difficult  or  not  possible  at  all.  In¬ 
stead,  these  objective  attributes  require  human  input  for  ac¬ 
curate  evaluation. 

To  obtain  evaluations  over  these  objective  attributes,  one 
may  ask  a  number  of  specialists  to  evaluate  the  set  of  vehicle 
design  concepts.  This  assumes  the  requisite  ability  is  imbued 
within  this  group  of  specialists.  Oftentimes  though,  the  abil¬ 
ity  to  make  a  comprehensive  evaluation  is  instead  scattered 
over  the  “collective  intelligence”  of  a  much  larger  crowd  of 
people  with  diverse  backgrounds  [1]. 

Crowdsourced  evaluation,  or  the  delegation  of  an  eval¬ 
uation  task  to  a  large  and  unknown  group  of  people,  is 
a  promising  approach  to  obtain  such  design  evaluations. 
Crowdsourced  evaluation  draws  from  the  pioneering  works 
of  online  communities,  like  Wikipedia,  which  have  shown 
that  accuracy  and  comprehensiveness  are  possible  in  a  large 
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2  RELATED  WORK 


crowdsourced  setting.  Although  many  successful  online 
communities  exist,  there  are  limited  reference  materials  on 
the  use  of  crowdsourced  evaluation  for  engineering  design. 

In  this  study,  we  explore  how  the  ability  of  evaluators 
in  the  crowd  affects  the  crowdsourced  evaluation  process, 
where  ability  is  defined  as  the  probability  that  a  participant 
gives  an  evaluation  close  to  the  design’s  true  score.  The 
choice  of  exploring  ability  comes  from  an  important  les¬ 
son  from  successful  online  community  efforts,  namely,  the 
need  to  implement  a  systematic  method  of  filtering  “signal” 
from  “noise”  [2].  In  a  crowdsourced  evaluation  process,  this 
manifests  itself  as  a  need  of  screening  good  evaluations  from 
bad  evaluations,  in  particular  when  we  are  given  a  heteroge¬ 
neous  crowd  made  up  of  a  mixture  of  high-ability  and  low- 
ability  participants.  In  this  case,  averaging  evaluations  from 
all  participants  with  equal  weight  will  reduce  the  accuracy 
of  the  crowd’s  combined  evaluation  due  to  bad  evaluations 
from  low-ability  participants.  Accordingly,  a  desirable  goal 
is  to  identify  the  high-ability  participants  from  the  rest  of  the 
crowd,  as  their  “signal”  will  be  closer  to  the  true  scores  of 
the  designs,  and  their  evaluations  may  be  subsequently  given 
more  weight. 

To  achieve  this  goal,  we  statistically  model  the  crowd¬ 
sourced  evaluation  process  with  a  Bayesian  network  that 
does  not  require  prior  knowledge  of  the  true  scores  of  the 
designs  or  of  the  ability  of  each  evaluator  in  the  crowd,  yet 
still  aims  to  identify  the  high-ability  participants  within  the 
crowd.  This  model  links  the  ability  of  evaluators  in  the  crowd 
(i.e.,  knowledge  or  experience  for  the  design  being  evalu¬ 
ated),  the  evaluation  difficulty  of  each  design  (e.g.,  a  detailed 
3D  model  provides  more  information  than  a  2D  sketch  and 
may  therefore  be  easier  for  an  expert  to  evaluate  accurately), 
and  the  true  score  of  each  of  the  designs.  The  model  rests 
on  the  key  assumption  that  low-ability  evaluators  are  more 
likely  to  “guess,”  and  while  guessing,  to  evaluate  designs 
more  randomly.  This  assumption  is  modeled  by  defining  an 
evaluation  be  a  random  variable  centered  at  the  true  score  of 
the  design  being  evaluated  [3].  A  graphical  representation  of 
the  Bayesian  network  showing  these  relationships  is  given  in 
Figure  1. 

The  performance  of  the  Bayesian  network  versus  the 
baseline  method  of  Averaging  were  explored  through  two 
studies.  First,  we  created  simulated  crowds  to  generate  eval¬ 
uations  for  a  set  of  designs.  These  crowds  had  a  homo¬ 
geneous  or  heterogeneous  ability  distribution,  representing 
two  cases  that  may  be  found  in  a  human  crowd.  Second, 
we  used  a  human  crowd  recruited  from  the  crowdsourcing 
platform  Amazon’s  Mechanical  Turk  [4],  and  performed  a 
crowdsourced  evaluation  with  the  same  crowd  and  task  prop¬ 
erties  as  in  the  simulation. 

The  remainder  of  this  paper  is  organized  as  follows. 
Section  2  reviews  related  work  from  engineering  design,  psy¬ 
chometrics,  and  crowdsourcing  literature,  as  well  as  research 
motivations  from  industry.  Section  3  presents  the  simulation 
environment  and  modeling  assumptions.  Section  4  details 
the  statistical  inference  scheme  of  the  Bayesian  network. 
Section  5  descibes  the  simulated  crowd  study  and  results. 
Section  6  describes  the  human  crowd  study  and  discusses  its 


Fig.  1.  Graphical  representation  of  the  Bayesian  network  model. 
This  model  describes  a  crowd  of  evaluators  making  evaluations 
that  have  error  from  the  true  score  d>j.  Each  evaluator  has  an  abil¬ 
ity  Op  and  each  design  has  an  difficulty  Jj.  The  gray  shading  on 
the  evaluation  denotes  that  it  is  the  only  observed  data  for  this 
model. 

results.  We  conclude  in  Section  7  with  limitations  of  this 
work  and  opportunities  for  future  development. 

2  Related  Work 

Within  the  engineering  design  community,  attention  is 
being  drawn  to  the  use  of  crowdsourcing  for  better  inform¬ 
ing  subjective  design  decisions  [5].  Methods  using  publi- 
cally  accessible  crowdsourced  data  from  social  media  sites 
have  been  used  for  preference  learning  [6, 7] .  More  directed 
crowdsourced  evaluation  with  online  surveys  have  been  also 
used  for  idea  evaluation  [8],  creativity  evaluation  [9],  and 
aesthetic  preference  learning  [10].  Our  work  differs  from 
these  works  in  that  we  focus  on  an  objective  task,  thus  ne¬ 
cessitating  the  estimation  of  evaluator  ability. 

Much  literature  modeling  the  ability  of  evaluators  in  a 
crowd  exists  from  the  psychometrics  community  under  Item 
Response  Theory  [11]  and  Rasch  Models  [12].  These  mod¬ 
els  have  been  applied  to  standardized  tests,  with  several  ex¬ 
tensions  to  include  hierarchical  structure  [13]  similar  to  this 
study’s  model.  More  recently,  the  human-computer  inter¬ 
action,  machine  learning,  and  crowdsourcing  communities 
have  modeled  the  ability  of  evaluators  in  a  crowdsourced 
evaluation  process  for  various  tasks.  These  tasks  are  typi¬ 
cally  “human  easy,  computer  hard,”  such  as  image  annota¬ 
tion  [14, 15],  planning  and  scheduling  [16],  and  natural  lan¬ 
guage  processing  [17, 18]. 

Many  of  these  models  are  qualitatively  similar,  with 
differences  in  the  treatment  of  evaluator  bias  [15,  19,  20], 
form  of  the  likelihood  function  (e.g.,  ordinal,  ranking,  bi¬ 
nary)  [21],  extent  to  which  the  true  score  is  known  [22],  and 
methods  of  scaling  up  to  larger  datasets  [15,23].  Our  study 
is  also  qualitatively  similar  to  this  literature,  but  with  a  key 
difference  on  the  application  to  an  engineering  design  task 
and  its  subsequent  distribution  of  ability  in  the  crowd. 

Specifically,  many  of  these  recent  crowdsourced  evalu¬ 
ation  tasks  have  a  majority  of  evaluators  with  the  ability  to 
give  an  accurate  evaluation  (e.g.,  how  many  animals  are  in 
this  image?)  [24].  As  a  result,  either  averaging  or  taking  a 
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3  A  BAYESIAN  NETWORK  MODEL  EOR  HUMAN  EVALUATIONS 


majority  vote  of  the  crowd’s  evaluators  is  often  already  quite 
accurate  [25].  For  these  cases,  ability  often  represents  the 
notion  of  task  consistency  and  attentiveness,  with  low-ability 
evaluators  being  more  spammy  or  malicious  [15]. 

In  contrast,  engineering  design  tasks  may  require  abil¬ 
ity  that  is  more  sparsely  scattered  amongst  the  crowd.  This 
is  supported  by  prior  industrial  applications  of  crowdsourced 
evaluation  for  engineering  design.  The  Fiat  Mio  was  a  fully 
crowdsourced  vehicle  design  concept,  yet  the  large  number 
of  low-ability  submissions  resulted  in  Fiat  using  its  design 
and  engineering  teams  as  a  filter  without  the  use  of  algorith¬ 
mic  assistance  [26].  Local  Motors  Incorporated  developed 
the  Rally  Fighter  using  a  crowdsourced  evaluation  system 
similar  to  this  study,  but  strongly  weighted  evaluations  of 
the  internal  design  team  [27].  For  these  engineering  design 
tasks,  the  notion  of  ability  instead  may  represent  specialized 
knowledge  and  heuristics  necessary  to  give  an  accurate  eval¬ 
uation. 

3  A  Bayesian  Network  Model  for  Human  Evaluations 

Let  the  crowdsourced  evaluation  contain  D  designs  and 
P  evaluators.  We  denote  the  true  score  of  design  d  as 

G  [0,1],  and  the  evaluation  from  evaluator  p  for  design  d 
as  R  =  where  G  [0,1].  Each  design  d  has  an  eval¬ 
uation  difficulty  Jj,  and  each  evaluator  p  has  an  evaluation 
ability  ap.  Some  significant  assumptions  we  made  shall  be 
highlighted  here:  (1)  We  assume  that  evaluators  evaluate  de¬ 
signs  without  systematic  biases,  i.e.,  given  infinite  chances  of 
evaluating  one  specific  design,  the  average  score  of  the  evalu¬ 
ators  will  converge  to  the  true  score  of  that  design  regardless 
of  their  ability  [3];  note  that  this  assumption  also  implies  that 
no  evaluators  purposely  give  bad  evaluations;  (2)  we  assume 
that  evaluation  responses  are  independent,  i.e.,  the  evaluation 
on  one  design  from  one  user  will  not  be  affected  by  the  eval¬ 
uation  made  by  that  user  for  any  other  design,  nor  will  it  be 
affected  by  the  evaluation  given  by  a  different  user;  (3)  we 
assume  that  the  ability  of  evaluators  is  constant  during  the 
entire  evaluation  process;  (4)  we  assume  that  all  evaluators 
are  fully  incentivized  and  do  not  exhibit  fatigue.  Without  loss 
of  generality,  we  consider  human  evaluations  real- valued  in 
the  range  of  zero  to  one. 

The  evaluator  evaluation  is  modeled  as  a  random 
variable  following  a  truncated  Gaussian  distribution  around 
the  true  performance  score  <I>j  as  detailed  by  Eq.  (1)  and 
shown  in  Eigure  2a. 


(a)  (b) 

Fig.  2.  (a)  Low  evaluation  ability  (dashed)  relative  to  the  design  eval¬ 
uation  difficulty  results  in  an  almost  uniform  distribution  of  an  evalua¬ 
tor’s  evaluation  response,  while  high  evaluation  ability  (dotted)  results 
in  evaluators  making  evaluations  closer  to  the  true  score,  (b)  An  eval¬ 
uator’s  evaluation  error  variance  as  a  function  of  that  evaluator’s 
ability  ap  given  some  fixed  design  difficulty  and  crowd-level  pa¬ 
rameters  0  and  y. 

the  design’s  difficulty  Jj.  In  addition,  this  function  is  sig¬ 
moidal  to  capture  the  notion  that  there  exists  a  threshold  of 
necessary  background  knowledge  to  make  an  accurate  eval¬ 
uation.  Eigure  2b  illustrates  this  function.  We  set  the  first 
requirement  on  the  evaluator’s  error  random  variable  using 
the  expectation  operator  E  in  Eq.  (3). 

The  random  variables  0  and  y  are  introduced  as  model 
parameters  to  allow  more  fiexibility  in  modeling  evaluation 
tasks  and  are  assumed  to  be  the  same  for  all  evaluators  and 
designs:  A  high  value  of  the  scale  parameter  0  will  sharply 
bisect  the  crowd  into  good  evaluators  with  negligible  errors 
and  bad  evaluators  that  evaluate  almost  randomly;  the  loca¬ 
tion  parameter  y  captures  evaluation  losses  intrinsic  to  the 
system,  such  as  those  stemming  from  the  human-computer 
interaction. 

Next,  the  variance  V  of  the  evaluator  error  is  considered 
constant,  capturing  the  notion  that,  while  we  hope  the  major 
variability  in  the  evaluation  error  to  be  captured  by  Equation 
(3),  other  reasons  exist  to  spread  this  error,  represented  by 
constant  C  in  Eq.  (4). 


rpd  --  Trancated-Gaussian  (4)^,0^^) ,  rpd  G  [0, 1]  (1) 


V[cTy  =C 


(4) 


The  variance  of  density  is  interpreted  as  the  error  an 

evaluator  makes  when  using  his  or  her  cognitive  processes 
while  evaluating  the  design,  and  is  described  by  a  random 
variable  taking  an  Inverse-Gamma  distribution: 

^  Inverse-Gamma  (a^ j j )  (2) 


Eollowing  the  requirements  given  by  Eq.  (3)  and  (4),  we 
reparameterize  the  Inverse-Gamma  of  Eq.  (2)  to  obtain  Eq. 
(5)  and  (6). 


^pd  — 


1 


+  2 


(5) 


The  average  evaluation  error  for  a  given  evaluator  on  a 
given  design  is  a  function  of  the  evaluator’s  ability  ap  and 


1 


^Q(dd-ap)-y 


( _ ^ _ 


(6) 


3  UNCLASSIFIED:  Distribution  Statement  A.  Approved  for  public  release. 


UNCLASSIFIED:  Distribution  Statement  A 


5  SIMULATED  CROWD  STUDY 


Case  I:  Homogenous  Crowds  Case  II;  Heterogeneous  Crowds 


Evaluator  Ability  ap  Evaluator  Ability  ap 


Case 

Type  of  Crowd 

Varied  Parameter 

Eigure 

Number  of  Crowd  Simualtions 

I 

Homogeneous  Crowd 

Average  Crowd  Ability 

4 

250 

II 

Heterogeneous  Crowd 

Variance  of  Crowd  Ability 

5 

250 

Fig.  3.  Crowd  ability  distributions  for  Cases  I  and  II  that  test  how  the  abilities  of  evaluators  within  the  crowd  affect  evaluation  error  for 
homogeneous  and  heterogeneous  crowds,  respectively.  Three  possible  sample  crowds  are  shown  for  both  cases. 


The  hierarchical  random  variables  of  the  evaluator’s 
evaluation  ability  Gp  and  the  design’s  evaluation  difficulty 
dd  are  both  restricted  to  the  range  [0,1].  We  let  their  distri¬ 
butions  be  truncated  Gaussians  with  parameters 
Gj  set  globally  for  all  evaluators  and  designs  as  shown  in  Eq. 
(7)  and  (8). 

Gp  ^  Truncated-Gaussian  (^^,g^)  ,  Gp  G  [0, 1]  (7) 

^  Truncated-Gaussian  (^j,Gj) ,  G  [0, 1]  (8) 

The  probability  densities  over  0  and  y  are  assumed  as 
Gaussian  with  parameters  Gq,  /lij,  Gy  as  shown  in  Eq.  (9) 
and  (10). 


0  ^  Gaussian  (jd^^G^^ 

(9) 

y  ~  Gaussian  (^y,  Gy) 

(10) 

Einally,  by  combining  all  random  variables  described  in 
this  section,  we  obtain  the  joint  probability  density  function 
shown  in  Eq.  (11).  Note  that  all  hyperparameters  are  implic¬ 
itly  included. 

p(a,d,4),R,e,Y)  =  (11) 

P  D 

p{Q)p{y)  n  p{ap)Y[  p{rpd \ap,dd,Q,  J,  ^d)p{dd)p{^d) 

p=\  d=\ 


4  Estimation  and  Inference  of  the  Bayesian  Network 

The  proposed  Bayesian  network  model  is  built  upon 
the  following  random  variables:  evaluators’  abilities 
designs’  difficulties  true  scores  of  designs 

and  parameters  -  0,  y,  G^,  Gj.  This  section 
explains  the  settings  for  infering  the  random  variables  and 
estimating  the  parameters  using  the  observed  evaluations  of 
the  evaluators  R  =  {rpd}p=\^...^P4=\^...^D- 


Two  techniques  are  used  in  sequence.  Maximum  a 
posteriori  estimation  is  performed  using  Powell’s  conju¬ 
gate  direction  algorithm  [28],  a  derivative-free  optimization 
method,  to  get  initial  estimates  of  the  parameters  that  max¬ 
imize  Equation  (11).  These  point  estimates  are  then  used 
to  initiate  an  adaptive  Metropolis-Hastings  Markov  Chain 
Monte  Carlo  (MCMC)  algorithm  [29-31]  that  determines 
the  estimates  of  all  unknown  parameters  and  infers  posterior 
distributions  of  the  random  variables.  The  posterior  sample 
size  of  the  single-chained  MCMC  simulation  is  set  to  10^, 
thinned  by  a  factor  of  2,  with  the  first  half  discarded  as  burn- 
in. 

5  Simulated  Crowd  Study 

We  now  study  how  the  ability  distribution  of  the  crowd 
affects  the  crowdsourced  evaluation  process  using  Monte 
Carlo  simulations.  There  are  two  main  goals  of  this  study. 
Eirst,  we  want  to  understand  how  crowds  made  up  of  dif¬ 
ferent  mixtures  of  high  and  low-ability  evaluators  affect  the 
crowd’s  combined  scores  of  designs  and  the  subsequent  eval¬ 
uation  error  from  the  true  scores  of  the  designs.  Second,  we 
want  to  understand  the  performance  differences  between  the 
Bayesian  network  and  Averaging.  Specifically  of  interest  are 
the  conditions  under  which  the  Bayesian  network  is  able  to 
find  the  subset  of  high-ability  evaluators  within  the  crowd  so 
that  it  can  give  greater  weight  to  their  responses. 

There  are  two  crowd  ability  distribution  cases  we  test,  as 
shown  in  Eigure  3.  Case  I  is  that  of  a  homogeneous  crowd, 
where  all  evaluators  making  up  the  crowd  have  similar  abil¬ 
ities.  The  varied  parameter  in  the  homogenous  case  is  the 
average  ability  of  the  crowd,  thus  testing  the  question:  How 
well  can  a  crowd  perform  if  no  individual  evaluator  can  eval¬ 
uate  correctly  or,  alternatively,  if  every  evaluator  can  eval¬ 
uate  correctly?  Case  II  is  that  of  a  heterogeneous  crowd, 
where  the  crowd  is  made  up  of  a  mixture  of  high  and  low- 
ability  evaluators.  In  this  case  we  fix  the  average  ability  of 
the  crowd  to  be  low,  so  that  most  evaluators  cannot  evaluate 
designs  correctly.  Instead,  the  varied  parameter  in  the  hetero¬ 
geneous  case  is  the  variance  of  the  crowd’s  ability  distribu- 
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Fig.  4.  Case  I:  Design  evaluation  error  from  the  Averaging  and  the 
Bayesian  network  methods  as  a  function  of  average  evaluator  ability 
for  homogeneous  crowds.  This  plot  shows  that  when  dealing  with 
homogeneous  crowds,  combining  the  set  of  evaluator  responses  into 
the  crowd’s  combined  score  is  invariant  to  the  combination  method 
used. 

tion.  This  tests  the  question:  How  well  can  a  crowd  perform 
as  a  function  of  its  proportion  of  high-ability  to  low-ability 
evaluators? 

The  procedure  for  these  studies  is  to  use  the  Monte  Carlo 
simulation  environment  to:  (1)  Generate  a  crowd  made  up  of 
evaluators  with  abilities  drawn  from  the  tested  ability  distri¬ 
bution  (Case  I  or  II),  and  a  set  of  designs  with  true  scores 
unknown  to  the  crowd;  (2)  simulate  the  evaluation  process 
by  generating  a  rating  between  1  and  5  that  each  evalua¬ 
tor  within  the  crowd  gives  to  each  design;  (3)  combine  the 
evaluator-level  ratings  into  the  crowd’s  combined  score  for 
each  design  using  either  the  Bayesian  network  or  Averag¬ 
ing;  and  (4)  calculate  the  evaluation  error  between  the  true 
scores  of  the  designs  and  the  combined  scores  from  either 
the  Bayesian  network  or  Averaging. 

The  simulation  setup  for  these  studies  consisted  of  60 
evaluators  per  crowd,  as  well  as  eight  designs  with  scores 
drawn  uniformly  from  the  range  [0,1]  and  evaluation  diffi¬ 
culties  {dd}  set  at  0.5  for  all  designs.  The  evaluation  pro¬ 
cess  for  each  evaluator  is  to  rate  all  eight  designs  in  the  con¬ 
tinuous  interval  [1,5]  according  to  a  deterministic  equation 
given  by  the  right  hand  side  of  Equation  (3),  with  the  lo¬ 
cation  parameter  y  set  at  0  and  the  scale  parameter  0  set  at 
0.1.  After  the  crowd’s  combined  scores  are  obtained,  either 
by  the  Bayesian  network  or  Averaging,  the  evaluation  error 
between  the  combined  scores  and  the  true  scores  is  calcu¬ 
lated  using  the  mean- squared  error  (MSE)  metric  as  shown 
in  Equation  (12). 

1  ^  9 

MSE=-'£{^d-^d)  (12) 

^  d=l 

The  results  of  Case  I  are  shown  in  Eigure  4.  Each  data 
point  represents  a  distinct  simulated  crowd  with  average  abil¬ 
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Crowd  Variance  of  Evaluator  Abilities 

Fig.  5.  Case  II:  Design  evaluation  error  over  a  set  of  designs  for 
a  mixed  crowd  with  low  average  evaluation  ability.  With  increasing 
crowd  variance  of  ability  there  is  an  increasingly  higher  proportion 
of  high-ability  evaluators  present  within  the  crowd.  This  leads  to  a 
point  where  the  Bayesian  network  is  able  to  identify  the  cluster  of 
high-ability  evaluators,  upon  which  evaluation  error  drops  to  zero. 

ity  given  on  the  x-axis,  and  associated  design  evaluation  error 
between  the  overall  estimated  score  and  the  true  scores  on  the 
y-axis.  All  crowds  in  Case  I  were  generated  using  the  same 
narrow  crowd  ability  variance  =  0.1  to  create  homoge¬ 
neous  crowds.  The  results  show  that  if  the  average  evaluator 
evaluation  ability  is  relatively  high,  both  Averaging  and  the 
Bayesian  network  perform  equally  well  with  small  design 
evaluation  error.  In  contrast,  when  the  average  ability  is  rel¬ 
atively  low,  neither  Averaging  nor  the  Bayesian  network  can 
estimate  the  true  scores  very  well. 

This  observation  agrees  with  intuition.  A  group  of  eval¬ 
uators  where  “no  one  has  the  ability”  to  evaluate  a  set  of  de¬ 
signs  should  not  collectively  have  the  ability  to  evaluate  a  set 
of  designs  just  by  changing  the  relative  weightings  of  evalu¬ 
ators  and  their  individual  evaluation  responses  upon  combi¬ 
nation  when  determining  the  crowd’s  combined  score.  Simi¬ 
larly,  a  group  of  evaluators  where  “everyone  has  the  ability” 
to  evaluate  a  set  of  designs  should  perform  well  regardless 
of  the  relative  weighting  between  evaluators.  The  key  result 
for  Case  I  is  this:  When  the  crowd  has  a  homogeneous  distri¬ 
bution  of  evaluator  abilities,  it  does  not  matter  what  weight¬ 
ing  scheme  one  assigns  between  various  evaluators  and  their 
evaluations;  the  Bayesian  network  and  Averaging  will  per¬ 
form  similarly  to  each  other. 

The  results  of  Case  II  are  shown  in  Eigure  5.  Contrary  to 
Case  I,  distinct  crowds  represented  by  each  data  point  have 
on  average  the  same  ability  =  0.2.  Instead,  moving  right 
along  the  x-axis  designates  crowds  with  increasingly  higher 
proportions  of  high-ability  evaluators  within  the  crowd.  In 
this  case,  we  observe  that  the  Bayesian  network  performs 
much  better  than  Averaging  after  a  certain  point  on  the  x- 
axis;  the  point  where  a  sufficient  number  of  high- ability  eval¬ 
uators  is  contained  within  the  crowd.  Under  these  conditions. 
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Fig.  6. 


(a)  Boundary  conditions  for  bracket  strength  evaluation,  (b)  the  set  of  all  eight  bracket  designs 


the  Bayesian  network  identifies  the  small  group  of  experts 
from  the  less  competent  crowd  and  weighs  their  evaluation 
more  so  than  the  rest,  thus  leading  to  combined  scores  much 
closer  to  the  true  scores  of  the  designs.  This  observation  is 
not  present  when  the  crowd  does  not  have  the  sufficient  num¬ 
ber  of  high-ability  evaluators  within  the  crowd.  When  this 
occurs,  as  is  shown  on  the  left  side  of  the  x-axis,  the  situa¬ 
tion  of  “no  one  has  the  ability”  is  recreated  from  Case  I. 

In  summary,  we  have  created  simulated  crowds  to  test 
the  infiuence  of  crowd  ability  on  the  crowdsourced  evalu¬ 
ation  process.  Two  cases  were  tested,  representing  homo¬ 
geneous  and  heterogeneous  ability  distributions.  Under  the 
modeling  assumptions  described  in  Section  3,  we  find  that: 
(1)  When  the  crowd  is  homogeneous,  it  does  not  matter 
what  weighting  scheme  is  used,  as  both  Averaging  and  the 
Bayesian  network  give  similar  results;  (2)  when  the  crowd 
is  heterogeneous,  the  Bayesian  network  is  able  to  output  the 
crowd’s  combined  score  much  closer  to  the  true  scores  under 
the  condition  that  a  sufficient  number  of  high-ability  evalua¬ 
tors  exist  within  the  crowd. 


6  Human  Crowd  Study 

In  this  section  we  set  up  a  design  evaluation  task  for  a 
real  human  crowd  to  test  our  modeling  assumptions.  The 
evaluation  task  was  chosen  to  be  a  classic  structural  design 
problem  for  a  load-bearing  bracket  [32],  in  which  evaluators 
are  asked  to  rate  the  capabilities  of  bracket  designs  to  carry  a 
vertical  load  as  shown  in  Figure  6. 

Participants 

The  human  crowd  consisted  of  181  evaluators  recruited 
using  the  crowdsourcing  platform  Amazon  Mechanical  Turk. 
For  the  bracket  designs,  eight  bracket  topologies  were  gen¬ 
erated  using  the  same  amount  of  raw  material.  The  deforma¬ 
tion  induced  by  tensile  stress  upon  vertical  loading  of  each 
bracket  was  calculated  in  OptiStruct  [33].  The  strength  of 
a  bracket  was  defined  as  the  amount  of  deformation  under  a 
common  load,  and  was  subsequently  scaled  linearly  between 
1  and  5  as  labeled  in  Figure  6.  The  scaled  strength  values 
were  considered  as  the  true  scores,  which  were  later  used  to 
calculate  evaluation  errors  from  the  estimations  from  either 
the  Bayesian  network  or  Averaging  methods. 

Procedure 

The  evaluation  process  for  each  evaluator  was  as  fol¬ 
lows:  The  eight  bracket  designs  were  first  presented  all  to¬ 
gether  to  the  user,  who  was  then  asked  to  review  these  de¬ 


signs  to  get  an  overall  idea  of  their  strengths.  After  at  least 
20  seconds,  the  user  was  allowed  to  continue  to  the  next  stage 
where  the  designs  were  presented  sequentially  and  in  random 
order.  For  each  design,  the  evaluator  was  asked  to  evaluate  its 
strength  using  a  rating  between  1  and  5,  with  1  being  “Very 
Weak”  and  5  “Very  Strong.”  To  gather  these  data,  a  web¬ 
site  with  a  database  backend  was  set  up  that  recorded  when 
an  evaluator  gave  an  evaluation  to  a  particular  bracket  de¬ 
sign  [34]. 

Data  analysis 

A  preprocessing  step  was  carried  out  before  the  data 
were  fed  into  either  the  Bayesian  network  or  Averaging  tech¬ 
niques.  Specifically,  since  some  evaluators  would  give  rat¬ 
ings  all  above  3  while  some  others  tended  to  give  ratings  all 
around  3,  all  evaluations  were  linearly  rescaled  to  a  range 
of  1-5.  It  should  be  noted  that  while  this  mapping  ensures 
that  everyone  gives  ‘I’s  and  ‘5’s,  it  does  not  help  to  re¬ 
move  nonlinear  biases  in  between  an  evaluator’s  most  ex¬ 
treme  evaluations.  To  calculate  design  evaluation  error,  the 
same  mean-squared  error  metric  was  used  as  in  the  simulated 
crowd  study  and  as  given  in  Equation  (12). 

6.1  Results 

The  Bayesian  network  did  worse  than  Averaging  when 
estimating  the  true  scores  of  the  bracket  designs  as  shown  in 
Table  1. 


Design  Evaluation  Error  (std.) 

Averaging 

1.001  (N/A) 

Bayesian  Network 

1.728  (0.006) 

Table  1.  Mean-squared  evaluation  error  and  standard  deviation 
from  entire  human  crowd  using  Averaging  and  Bayesian  network  es¬ 
timation. 

According  to  the  simulation  results,  the  Bayesian  net¬ 
work  can  only  do  worse  than  Averaging  if  it  is  not  able  to 
find  the  high-ability  evaluators,  or  experts,  in  the  crowd. 
This  could  happen  under  either  of  the  following  two  situa¬ 
tions:  (1)  The  modeling  assumption  made  in  Section  3  holds, 
namely,  that  low-ability  evaluators  are  less  consistent  (more 
random)  in  their  evaluations,  but  there  are  just  no  high-ability 
evaluators;  (2)  the  modeling  assumption  is  violated,  in  that 
there  exist  low-ability  evaluators  consistently  wrong  in  their 
evaluations.  In  this  situation,  the  Bayesian  network  model 
would  mistakenly  identify  these  individuals  as  having  high 
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Fig.  7.  Clustering  of  evaluators  based  on  how  similar  their  evalua¬ 
tions  are  across  all  eight  designs.  Each  black  or  colored  point  repre¬ 
sents  an  individual  evaluator,  where  colored  points  represent  evalua¬ 
tors  who  were  similar  to  at  least  3  other  evaluators,  and  black  points 
represent  evaluators  who  tended  to  evaluate  more  uniquely. 


abilities  due  to  their  consistency  and  overweigh  their  incor¬ 
rect  evaluations. 

Visualizing  the  crowd’s  ability  distribution 

We  now  show  that  situation  (2)  above  has  occurred; 
namely,  there  are  indeed  “consistently  wrong”  evaluators  that 
exist  in  the  human  crowd.  To  show  this,  we  cluster  the  eight¬ 
dimensional  human  evaluation  data  to  find  clusters  of  sim¬ 
ilar  evaluators,  and  then  flatten  these  clustered  data  to  two 
dimensions  for  visualization.  This  clustering  finds  groups 
of  evaluators  who  give  consistent  evaluation,  regardless  of 
whether  such  evaluations  are  correct  or  incorrect.  In  other 
words,  members  of  a  cluster  were  consistent  in  their  evalua¬ 
tions  not  necessarily  to  the  right  or  wrong  answer,  but  con¬ 
sistent  to  others  in  the  cluster. 

The  clustering  algorithm  we  have  used  is  density-based 
and  uses  the  Euclidean  distance  metric  to  identify  clusters  of 
evaluators  who  gave  similar  evaluations  [35].  This  clustering 
method  was  chosen  as  it  can  account  for  varying  clustering 
sizes,  as  well  as  not  necessitating  that  every  evaluator  belong 
to  a  cluster.  The  fiattening  from  eight  dimensions  to  two 
dimensions  was  done  using  multidimensional  scaling. 

We  see  in  Figure  7  that  five  clusters  of  similar  evaluators 
were  found,  while  Table  2  gives  the  evaluation  error  of  each 
cluster.  We  find  that  the  cyan  cluster  is  made  up  of  high 
ability  “expert”  evaluators,  as  evidenced  by  their  evaluation 
error.  In  contrast,  the  other  four  clusters  were  consistent  but 
wrong  in  their  evaluations. 

This  analysis  suggests  that  finding  high-ability  “expert” 
evaluators  through  an  open  call  is  possible  even  for  a  task 
like  structural  design,  in  which  ability  is  sparsely  distributed 
through  the  crowd.  However,  while  the  Bayesian  network  is 
a  theoretical  way  to  identify  these  evaluators,  its  application 
in  reality  is  limited  by  the  fact  that  there  exist  other  (more 
numerous)  clusters  of  evaluators  who  are  just  as  consistent 
yet  wrong  in  their  evaluations. 


Cluster  Color 

Design  Evaluation  Error 

Blue 

1.826 

Cyan  “Experts” 

0.796 

Red 

1.805 

Green 

2.394 

Magenta 

6.275 

Table  2.  Mean-squared  evaluation  errors  from  the  5  clusters  of  sim¬ 
ilarly  evaluators. 


6.2  Follow-up  to  human  crowd  study 

For  completeness  of  the  human  study,  we  conducted  two 
follow-up  experiments  to  capture  the  differences  between 
the  simulated  crowd  assumptions  and  results,  and  the  human 
crowd  results.  The  first  follow-up  experiment  augments  the 
human  crowd  data  with  simulated  experts,  in  order  to  offset 
the  “consistently  wrong”  evaluators  with  a  larger  cluster  of 
experts.  The  second  follow-up  experiment  remains  entirely 
in  simulation,  and  shows  that  the  existence  of  enough  “con¬ 
sistently  wrong”  evaluators  will  also  cause  the  Bayesian  net¬ 
work  to  fail  in  simulation  as  well,  thus  mimicking  the  results 
of  the  human  study. 

6.2.1  Human  crowd  augmented  with  simulated  experts 

We  show  in  Figure  8  how  the  design  evaluation  error 
would  be  reduced  if  extra  expert  evaluations,  i.e.,  evaluations 
exactly  the  same  as  true  scores,  were  collected  in  addition  to 
the  original  181  responses  from  the  human  study.  Notice 
that  the  error  should  be  reduced  monotonically  as  the  num¬ 
ber  of  experts  increases.  However,  the  stochastic  nature  of 
the  estimation  process  of  a  Bayesian  network  could  cause 
sub-optimal  estimations.  Similar  to  the  simulations  in  Fig¬ 
ure  5,  one  can  observe  the  phase-changing  phenomenon  in 
the  change  of  the  design  evaluation  error. 


Human  Crowd  Data  Augmented  with  Simulated  Experts 


5  10  15 

Number  of  Added  Experts 


Fig.  8.  Design  evaluation  error  with  respect  to  additional  experts. 
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6.2.2  Simulation  of  “consistently  wrong”  evaluators 

In  this  scenario,  we  tested  a  set  of  simulations  in  which 
the  crowd  contained  two  clusters  of  evaluations.  One  cluster, 
“the  experts”,  can  always  evaluate  correctly;  the  other  clus¬ 
ter  is  almost  the  same,  except  that  evaluators  in  this  cluster 
always  rate  one  design  off  by  0.5.  We  vary  the  crowd  pro¬ 
portion  of  “experts”  from  0  to  1  and  calculate  the  correspond¬ 
ing  evaluation  errors,  as  shown  in  Figure  9.  While  the  error 
from  Averaging  changes  linearly  with  respect  to  the  propor¬ 
tion,  that  from  the  Bayesian  network  takes  only  two  phases. 
The  result  mimicks  what  we  saw  with  the  human  study;  the 
Bayesian  network  simply  considers  one  of  the  two  groups 
as  the  experts  and  trusts  its  evaluations,  and  that  decision  is 
made  based  on  the  group  sizes. 


Simulated  Crowd  with  "Consistently  Wrong"  Evaluators 


Fig.  9.  Design  evaluation  error  with  respect  to  the  proportion  of  the 
expert  group. 

7  Conclusion 

Crowdsourcing  is  a  promising  method  to  evaluate  engi¬ 
neering  design  concepts  that  require  human  input,  due  to  the 
possibility  of  leveraging  evaluation  ability  that  is  distributed 
over  a  large  number  of  people.  A  common  characterisitic  of 
crowdsourced  design  evaluation  processes  is  that  the  crowd 
is  composed  of  a  heterogeneous  mixture  of  high  and  low- 
ability  evaluators.  A  key  challenge  in  such  crowdsourced 
evaluation  processes  is  to  find  the  subset  of  high  ability,  or 
expert,  evaluators  in  the  crowd  such  that  their  evaluations 
may  be  given  more  weight. 

In  this  paper  we  proposed  a  Bayesian  network  to  model 
human  evaluations.  The  key  modeling  assumption  is  that 
low-ability  evaluators  tend  to  give  less  consistent  (more  ran¬ 
dom)  evaluations  than  expert  evaluators.  We  tested  using 
simulated  crowds  how  both  the  Averaging  and  the  Bayesian 
network  can  be  affected  by  the  distribution  of  evaluator  abil¬ 
ities  and  showed  that,  when  assumptions  hold,  the  Bayesian 
network  approach  is  preferable  to  simple  Averaging  and  re¬ 
quires  fewer  experts  to  achieve  a  good  estimation  of  the  true 
design  scores  across  all  simulation  settings. 

A  human  crowd  study  on  bracket  strength  evaluation 
was  then  conducted.  Evaluators  recruited  through  Amazon 
Mechanical  Turk  gave  evaluations  on  eight  bracket  designs 


according  to  how  strong  the  brackets  were  under  load.  The 
result  of  this  study  was  that  the  Bayesian  network  model 
did  worse  at  estimating  the  true  strengths  of  the  bracket  de¬ 
signs.  Upon  further  investigation,  it  was  found  that  there 
were  numerous  clusters  of  “consistently  wrong”  evaluators 
in  the  crowd.  These  clusters  caused  the  Bayesian  network  to 
believe  they  were  the  experts,  and  consequently  overweigh 
their  (wrong)  evaluations. 

While  the  human  study  did  not  showcase  the  superiority 
of  Bayesian  network  over  Averaging,  it  does  reveal  the  chal¬ 
lenges  of  performing  such  crowdsourced  evaluations  when 
dealing  with  even  a  simple  engineering  design  task.  The  dis¬ 
tribution  of  evaluation  ability  in  this  study  sharply  contrasts 
many  of  the  recent  successes  within  the  human-computer  in¬ 
teraction,  computer  vision,  and  crowdsourcing  communities; 
namely,  we  show  that  only  a  minority  of  the  crowd  are  ex¬ 
perts  and  that  there  exist  numerous  clusters  of  consistent  yet 
incorrect  evaluators. 

Further  study  into  methods  to  find  experts  in  settings  in 
which  they  are  the  minority  is  justified.  These  methods  may 
generalize  our  definition  of  evaluator  ability  by  incorporat¬ 
ing  relevant  information  about  the  evaluation  process,  as  well 
as  setup  analytic  conditions  under  which  it  is  impossible  to 
find  experts.  This  study  is  thus  a  first  step  in  showing  that 
extra  information  in  the  form  of  evaluator  variables,  design 
variables,  and  task  variables  may  be  needed  to  find  expert 
evaluators  for  even  simple  engineering  design  tasks,  as  such 
experts  are  otherwise  overshadowed  by  the  crowd. 
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